24 research outputs found

    Term Rewriting on GPUs

    Full text link
    We present a way to implement term rewriting on a GPU. We do this by letting the GPU repeatedly perform a massively parallel evaluation of all subterms. We find that if the term rewrite systems exhibit sufficient internal parallelism, GPU rewriting substantially outperforms the CPU. Since we expect that our implementation can be further optimized, and because in any case GPUs will become much more powerful in the future, this suggests that GPUs are an interesting platform for term rewriting. As term rewriting can be viewed as a universal programming language, this also opens a route towards programming GPUs by term rewriting, especially for irregular computations

    Rocket: Efficient and Scalable All-Pairs Computations on Heterogeneous Platforms

    Get PDF
    All-pairs compute problems apply a user-defined function to each combination of two items of a given data set. Although these problems present an abundance of parallelism, data reuse must be exploited to achieve good performance. Several researchers considered this problem, either resorting to partial replication with static work distribution or dynamic scheduling with full replication. In contrast, we present a solution that relies on hierarchical multi-level software-based caches to maximize data reuse at each level in the distributed memory hierarchy, combined with a divide-and-conquer approach to exploit data locality, hierarchical work-stealing to dynamically balance the workload, and asynchronous processing to maximize resource utilization. We evaluate our solution using three real-world applications (from digital forensics, localization microscopy, and bioinformatics) on different platforms (from a desktop machine to a supercomputer). Results shows excellent efficiency and scalability when scaling to 96 GPUs, even obtaining super-linear speedups due to a distributed cache

    Semi-Automatic Generation of Assembly Instructions for Open Source Hardware

    Get PDF
    Documentation is an essential component of Open Source Hardware (OSH) projects both for co-development and replication of designs. However, creating documentation and keeping it up-to-date is often challenging and time-intensive. There are several systems that focus on this documentation challenge but they are limited in their support for keeping documentation up-to-date and relating CAD designs to documentation. This article proposes a semi-automated solution that relates the CAD design semantically to a textual specification from which we generate assembly instructions semi-automatically. Our system contains a CAD plugin and a compiler for the textual specification with which we show that we can replicate a state-of-the-art assembly manual to a high degree, that we can automate significant parts of the documentation process, and that our system can effectively adapt to documentation changes as a result of evolving designs. Our system leads to a methodology that we name “CAD-coupled documentation” integrating CAD design with the documentation process

    Distributed Manufacturing: A High-Level Node-Based Concept for Open Source Hardware Production

    Get PDF
    Distributed manufacturing is presented as a means to enable sustainable production and collaboration. Rather than rely on centralised production, distributed manufacturing promises to improve the flexibility and resilience to meet urgent production demands. New frameworks of production, based on manufacturing models with distributed networks, may provide functional examples to industrial practice. This paper discusses efforts in distributed production in the context of Free/Open source hardware and devises a conceptual framework for future pilots at which open source machines, such as a desktop 3D printer, may be manufactured in a network of open/fab lab nodes

    Towards an Effective Unified Programming Model for Many-Cores

    No full text
    Abstract—Building an effective programming model for manycore processors is challenging. On the one hand, the increasing variety of platforms and their specific programming models force users to take a hardware-centric approach not only for implementing parallel applications, but also for designing them. This approach diminishes portability and, eventually, limits performance. On the other hand, to effectively cope with the increased number of large-scale workloads that require parallelization, a portable, application-centric programming model is desirable. Such a model enables programmers to focus first on extracting and exploiting parallelism from their applications, as opposed to generating parallelism for specific hardware, and only second on platform-specific implementation and optimizations. In this paper, we first present a survey of programming models designed for programming three families of many-cores: general purpose many-cores (GPMCs), graphics processing units (GPUs), and the Cell/B.E.. We analyze the usability of these models, their ability to improve platform programmability, and the specific features that contribute to this improvement. Next, we also discuss two types of generic models: parallelismcentric and application-centric. We also analyze their features and impact on platform programmability. Based on this analysis, we recommend two application-centric models (OmpSs and OpenCL) as promising candidates for a unified programming model for many-cores and we discuss potential enhancements for them. I

    Implementing stencil problems in chapel: An experience report

    No full text
    Stencil operations represent a fundamental class of algorithms in high-performance computing. We are interested in what level of performance can be expected from a highproductivity language such as Chapel. To this effect we discuss four different implementations of a generic stencil operation with a convergence check after each iteration.We start with a sequential implementation followed by a global-view implementation that we experiment with both on a 16-core multi-core system as well as on a cluster with up to 16 such nodes using domain maps. We finish with a local-view implementation that explicitly encodes all design decisions with respect to parallel execution. This paper is set up as a two stage experience report: We mainly report our findings from the users' perspective without any feedback from the Chapel implementers. We then report additional analysis performed under guidance of the Chapel team. Our experimental findings show that Chapel performs as expected on a single node. However, it does not achieve the expected levels of performance on our multi-node system, neither with the data-parallel global-view approach, nor with the taskparallel local-view code. We discuss the root causes of our reduced performance in detail and report possible solutions

    Cashmere: Heterogeneous Many-Core Computing

    No full text
    New generations of many-core hardware become available frequently and are typically attractive extensions for data-centers because of power-consumption and performance benefits. As a result, supercomputers and clusters are becoming heterogeneous and start to contain a variety of many-core devices. Obtaining performance from a homogeneous cluster-computer is already challenging, but achieving it from a heterogeneous cluster is even more demanding. Related work primarily focuses on homogeneous many-core clusters. In this paper we present Cashmere, a programming system for heterogeneous many-core clusters. Cashmere is a tight integration of two existing systems: Satin is a programming system that provides a divide-and-conquer programming model with automatic load-balancing and latency-hiding, while Many-Core Levels is a programming system that provides a powerful methodology to optimize computational kernels for varying types of many-core hardware. We evaluate our system with several classes of applications and show that Cashmere achieves high performance and good scalability. The efficiency of heterogeneous executions is comparable to the homogeneous runs and is >90% in three out of four applications

    Optimization Techniques for GPU Programming

    No full text
    In the past decade, Graphics Processing Units have played an important role in the field of high-performance computing and they still advance new fields such as IoT, autonomous vehicles, and exascale computing. It is therefore important to understand how to extract performance from these processors, something that is not trivial. This survey discusses various optimization techniques found in 450 articles published in the last 14 years. We analyze the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning

    Automatically Inserting Synchronization Statements in Divide-and-Conquer Programs

    No full text
    Abstract-Divide-and-conquer is a well-known and important programming model that supports efficient execution of parallel applications on multi-cores, clusters, and grids. In a divide-and-conquer system such as Satin or Cilk, recursive calls are automatically transformed into jobs that execute asynchronously. Since the calls are non-blocking, consecutive calls are the source of parallelism. However, the programmer has to manually enforce synchronization with sync statements that indicate where the system has to wait for the result of the asynchronous jobs. In this paper, we investigate the possibility to automatically insert sync statements to relieve the programmer of the burden of thinking about synchronization. We investigate whether correctness can be guaranteed and to what extent the amount of parallelism is reduced. We discuss the code analysis algorithms that are needed in detail. To evaluate our approach, we have extended the Satin divideand-conquer system, which targets efficient execution on grids, with a sync generator. The fact that Satin uses Java as a base language helps the sync generator to reason about control flow and aliasing of references to objects. Our experiments show that, with our analysis, we can automatically generate synchronization points in virtually all real-life cases: in 31 out of 35 real-world applications the sync statements are placed optimally. The automatic placement is correct in all cases, and in one case the sync generator corrected synchronization errors in an application (FFT)
    corecore